SPARSE: a sparse hypergraph neural network for learning multiple types of latent combinations to accurately predict drug–drug interactions

Abstract Motivation Predicting side effects of drug–drug interactions (DDIs) is an important task in pharmacology. The state-of-the-art methods for DDI prediction use hypergraph neural networks to learn latent representations of drugs and side effects to express high-order relationships among two interacting drugs and a side effect. The idea of these methods is that each side effect is caused by a unique combination of latent features of the corresponding interacting drugs. However, in reality, a side effect might have multiple, different mechanisms that cannot be represented by a single combination of latent features of drugs. Moreover, DDI data are sparse, suggesting that using a sparsity regularization would help to learn better latent representations to improve prediction performances. Results We propose SPARSE, which encodes the DDI hypergraph and drug features to latent spaces to learn multiple types of combinations of latent features of drugs and side effects, controlling the model sparsity by a sparse prior. Our extensive experiments using both synthetic and three real-world DDI datasets showed the clear predictive performance advantage of SPARSE over cutting-edge competing methods. Also, latent feature analysis over unknown top predictions by SPARSE demonstrated the interpretability advantage contributed by the model sparsity. Availability and implementation Code and data can be accessed at https://github.com/anhnda/SPARSE. Supplementary information Supplementary data are available at Bioinformatics online.


Introduction
A drug-drug interaction (DDI) is a reaction between two drugs, whereby the effects of one drug are modified by the concomitant use of the second drug. A DDI might cause side effects, which are unwanted effects and are responsible for significant patient morbidity and mortality (Magro et al., 2012). Hence, predicting side effects of a DDI, i.e. DDI prediction, is a very important task to guarantee drug safety.
Using machine learning has emerged as a prominent approach for DDI prediction, making the prediction fast and highly accurate (Xu et al., 2019;Zitnik et al., 2018). The traditional machine learning methods such as support-vector machines (Kastrin et al., 2018), logistic regression (Mei and Zhang, 2021) or feedforward neural networks (Wang et al., 2019) use predefined drug features to predict side effects as labels. However, DDI data have more information. Particularly, DDIs can be represented by a graph, called a DDI graph, where nodes are drugs and edges are interacting drugs. The DDI graph can be learned with graph neural networks (Zitnik et al., 2018). Nonetheless, DDI graphs are only limited to pairwise relationships of drug pairs while there still exist many side effects, which can be represented by other relationships, such as co-occurrence. Then, a state-of-the-art generalization of a DDI graph can be a DDI hypergraph, which can capture higher-order relationships, where drugs and side effects are both nodes, and each hyperedge is a triple of a side effect with two interacting drugs.
On the DDI hypergraph, hypergraph neural networks can be applied to learn the representations of drugs and side effects altogether. In DDIs, two drugs with totally different properties can still interact with each other, hence the traditional hypergraph neural networks using similarity assumption on node representations are not suitable (Feng et al., 2019). Instead, CentSmoothie, a current cutting-edge hypergraph neural network for DDIs (Nguyen et al., 2021), assumes that each side effect is caused by a unique combination of latent features of the corresponding interacting drugs. However, in real life, each side effect might have many different mechanisms (Suleyman et al., 2010) that cannot be reflected in a single combination of drug latent features. Hence, it is necessary to learn different types of combinations of drug latent features for each side effect. This is the first problem (P1), which we address in this article.
To solve P1, we borrow one idea of stochastic block models (SBMs) on hypergraphs such that each node (e.g. drug or side effect) has one or several latent features (Anandkumar et al., 2013;Pal and Zhu, 2021) and there exist interactions (associations) of latent features. This method can learn different types of combinations of drug latent features for each side effect, at once. In addition, to improve the quality of learned latent features, input node features also can be used (Zhang et al., 2019). However, transformations from input node features and node relationships in the hypergraphs to latent features might be complex and, especially, non-linear. This is the second problem (P2), which has not been addressed in existing SBMs and we address in this article.
Moreover, DDI data are sparse (e.g. in the largest DDI dataset, 97.6% of all triples of drug-drug-side effects are not a DDI), suggesting that the model for learning DDIs also should be sparse. However, recent work on DDIs has not used this sparsity of the data (Nguyen et al., 2021;Zitnik et al., 2018), which might potentially impair model performance. This is the third problem (P3), which we address in this article.
We propose SPARSE, a new model for DDI prediction, to solve the above three problems. For P1, we assume that there exist drug and side effect latent features with latent interactions so that each side effect latent feature interacts with several pairs of drug latent features. For P2, we encode drug features and the DDI hypergraph altogether in the latent representations using a suitable hypergraph neural network. For P3, we guide the model to preserve the sparsity of the data using a suitable sparsity control. Figure 1 schematically illustrates these ideas of our model. That is, the model consists of two parts: (i) an encoder and (ii) a decoder. The encoder encodes the input of the DDI hypergraph (e.g. three hyperedges in Fig. 1) with drug features into latent spaces of drug and side effect latent representations, and interactions of latent features. The decoder reconstructs from the latent spaces the DDI hypergraph with new DDI predictions (e.g. the dotted hyperedge in Fig. 1). Finally, a sparsity prior (horseshoe priors in our model) is used to control the sparsity of the latent interactions.
Our extensive experiments first validated the advantage of SPARSE in terms of prediction performance by using both synthetic and real-world datasets. Throughout all experiments on prediction performance, SPARSE achieved better prediction performances than competing methods, such as CentSmoothie and SBM. For example, in the experiment of using the largest real DDI dataset, called TWOSIDES, SPARSE achieved area under the ROC curve (AUC) of 0.9524 and (area under the precision-recall curve (AUPR) of 0.882, while CentSmoothie achieved AUC of 0.9348 and AUPR of 0.8749 and SBM achieved AUC of 0.9337 and AUPR of 0.8583. Similarly when using JADERDDI, another DDI dataset, SPARSE achieved AUC of 0.9698 and AUPR of 0.7348, while CentSmoothie was AUC of 0.9684 and AUPR of 0.6044 and SBM was AUC of 0.9428 and AUPR of 0.5963.
We then examined the top prediction obtained by SPARSE, which is trained by using the whole TWOSIDES. That is, we checked the number of overlaps between the top 400 predictions by one method and DDIs in drugs.com (Drugs.com, 2021;Thelwall et al., 2017), which is a commonly used online web checker for DDI. We found 98 DDIs in drugs.com out of the top 400 predictions, while by using the same procedure, CentSmoothie found only 71 DDIs out of the top 400 predictions, implying that SPARSE can find new DDIs more than competing methods.
Finally, we validated the prediction results by characterizing the top predictions obtained by SPARSE. In more detail, we checked the biological properties, such as target proteins, of the top 10 triples of drug-drug-side effect, predicted by SPARSE, by using latent features connected to these top 10 predictions. We then found that top predictions can be associated with some biological mechanisms and particularly with responsible proteins/pathways. These results indicate that our model, SPARSE, can provide high predictive performances as well as latent biological knowledge beneficial to understand the background behind predicted DDIs.

Related work
Machine-learning models for DDI prediction can be divided into non-graph-based and graph-based ones. For non-graph-based models, the inputs are the predefined feature vectors of pairs of drugs, the outputs are the corresponding side effects, and the models are multi-label classifiers, e.g. support-vector machines (Kastrin et al., 2018) or a multilayer feedforward neural network (Wang et al., 2019). Instead of only using predefined drug feature vectors, graphbased methods for DDI use graph neural networks to learn new latent representations of drugs from molecular graphs or DDI graphs. In molecular graphs, each drug is considered as a graph that nodes are atoms and edges are connections of atoms (Harada et al., 2020;Xu et al., 2019). In DDI graphs, DDIs are considered as pairwise relationships and formulated in the form of a graph where nodes are drugs and edges are drug interactions with side effects as labels (Zitnik et al., 2018). The latter one has shown to be more effective for DDI prediction since it can use both pharmacological information and biological information rather than only molecular graphs (Zitnik et al., 2018).
However, one drawback of using graph neural networks on DDI graphs is that it does not use multiple relationships (labels) at the same time. Side effects themselves have relationships with each other, e.g. co-occurrences. Existing work often fixes them as onehot vectors to indicate the presence of the side effects. This representation considers side effects independently, potentially making the models under-utilize the side effect relationships.
Hypergraph neural networks on DDI overcome the above drawback by learning representations of drug and side effect nodes altogether in latent spaces (Nguyen et al., 2021). DDI is considered as high-order relationships of drug-drug-side effects in the form of a hypergraph where nodes are both drugs and side effects, and each hyperedge is a triple of two interacting drugs and a side effect caused by the drugs. There are two types of hypergraph neural networks models on the DDI: similarity based and non-similarity based. The similarity-based models, e.g. traditional spectral-based hypergraph neural networks, assume that interacting drugs should have similar representations (Fan et al., 2021;Feng et al., 2019). However, in DDI, two interacting drugs are not necessarily similar. For nonsimilarity models, the current state-of-the-art method is CentSmoothie (Nguyen et al., 2021) that assumes that the representation of a side effect can be represented by a combination of latent features of two drugs causing the side effect. However, CentSmoothie cannot deal with multiple combinations of latent features at the same time.
In order to deal with multiple combinations of latent features, one possible approach is to use the idea of SBMs, which can be applied to hypergraphs, with each node belonging to several latent features (groups) and associations of latent features (groups) (Anandkumar et al., 2013). However this has not been applied to DDI hypergraphs, and more importantly, SBM is based on linear assumption, while DDI can be generated through more complex relations to be represented by non-linearity.
Many studies have shown the benefits of sparsity regularization, which is a commonly used method to achieve sparsity of models, especially on noisy and sparse data (Carvalho et al., 2009;Tibshirani, 1996). In a Bayesian viewpoint, sparsity regularization can be understood as a result of using sparse prior distributions. A state-ofthe-art method for sparsity regularization is to use horseshoe priors (Carvalho et al., 2009;Piironen and Vehtari, 2017). It shows an advantage in comparison with traditional Laplace prior (Lasso regularization) (Tibshirani, 1996) in that the horseshoe prior allows to shrink in both directions: no shrinkage for important features and complete shrinkage for non-important (noise) features. A comparable shrinkage prior with the horseshoe prior is the spike-and-slab prior (Hoeting et al., 1999). However, the spike-and-slab prior is a discrete prior that requires the Markov chain Monte Carlo sampling for optimization, which is not effective for large-scale datasets like DDI.

Background
We recall definitions for horseshoe priors and n-mode tensor product for 3D tensors, which will be used later.

Horseshoe priors
We summarize the horseshoe prior (Carvalho et al., 2009), a stateof-the-art prior for sparsity control, for a non-negative 3D tensor: 0þ . The idea of the horseshoe prior is that each B i;j;k follows a normal distribution with the same zero mean and a different variance. Each variance has two parts: one is a global parameter sharing among all variances to decide the sparsity of B and one is a local parameter to decide the magnitude of each variance by using a heavy tail distribution with the half-Cauchy distribution. In more detail: where s is a global parameter for sparsity, and C þ ð0; 1Þ is a half-Cauchy distribution defined by: pðK i;j;k Þ ¼ 2 Both the horseshoe prior and Laplace prior (for Lasso regularization) are shrinkage priors such that by using priors, values of features tend to be shrunk (Piironen and Vehtari, 2017). LetB i;j;k be the optimal values without priors, then the optimal values having priors has the form: B i;j;k ¼ ð1 À j i;j;k ÞB i;j;k , where 0 j i;j;k 1 is a shrinkage factor depending on the priors. With Laplace prior (Lasso regularization), the density of j i;j;k tends to be a constant near 1 and disappears near 0, meaning that it always shrinks all features, containing important ones. In contrast, the density of j i;j;k with the horseshoe prior has two peaks at 0 and 1, meaning that the horseshoe prior allows two kinds of shrinkage: no shrinkage to maintain important features and complete shrinkage to remove unimportant features.

N-mode tensor product
The n-mode tensor product can be understood as a generalization of the matrix dot product in high-dimension that the product is processed at the nth dimension. Considering in the 3D space with a tensor: B 2 R K1ÂK2ÂK3 and a matrix H 2 R TÂKn ; n 2 f1; 2; 3g, the nmode product of B and H is denoted by BÂ n H and is defined for each of n ¼ 1, 2 and 3, as follows:

Problem formulation: DDI prediction
We formulate the DDI prediction problem as follows. Input: Given a DDI hypergraph:

Proposed model
We propose SPARSE: a sparse model for learning multiple types of latent combinations of side effects and drugs to predict DDIs. Our model follows an auto-encoder framework with two parts: an encoder and a decoder. The encoder encodes the DDI hypergraph with drug node features to latent spaces with latent representations of drugs and side effects (H), and interactions of latent features (B). The decoder aims to reconstruct the DDI hypergraph with new predicted hyperedges from H and B. In the following parts, we first present our latent interaction assumption with sparsity for the interactions of drugs and side effects, and then we describe the encoder and decoder.

Latent interaction assumption
To model DDIs, we suppose that there exist latent spaces with drug latent features and side effect latent features where DDIs occur. The latent interaction assumption is that two interacting drugs cause a side effect if there exist a pair of drug latent features of the two drugs that interact with a latent feature of the side effect.
In detail, the formulation for the latent interaction assumption can be described as follows. Let L D ¼ f1; . . . ; K D g and L S ¼ f1; . . . ; K S g be the sets of indices of latent features of drugs and side effect with K D and K S be the numbers of latent features. Let B 2 R KDÂKDÂKS 0þ be a 3D tensor representing interactions of latent features of drugs and side effects. The set of interacting latent features is: Considering a triple of two drugs and one side effect 2 R KS 0þ be the vectors representing the presence of latent features of the two drugs and the side effect, respectively. Let g u ¼ fi 2 L D jh d ðuÞ i > 0g; g v ¼ fi 2 L D jh d ðvÞ i > 0g and g t ¼ fi 2 L S jh s ðtÞ i > 0g be the sets of latent features of u, v and t, respectively.
Under the latent interaction assumption, u interacts with v to cause t if: or with tensor product formulation: In practice, we can change the value 0 on the right side of Equation (7) to a positive threshold. Equation (6) will be used to generate synthetic data in the experimental section. Equation (7) will be used in the decoder of the model.

Sparsity property
We first define formulations for sparsity measures of the DDI data and the latent interactions using the percentages of noninteractions. Let s d be the sparsity of the hypergraph G: The sparsity of the latent interactions s l is defined as the percentage of the number of non-interacting triples of the latent features per the total number of all triples of the latent features.
DDI data are sparse as per statistics in Table 1. It is shown that 97.6% and 99.87% of all triples are non-interacting in TWOSIDES and JADERDDI, respectively.
The motivation for us to use sparse models is that sparse models, according to statistical learning theory, are usually more reliable models if they could fit the training data well (Hastie, 2015). As our sparse models have sparse interactions among latent features, we will prove that they tend to generate sparse data and are suitable for DDI data. We show a relationship between sparsity of the models and sparsity of data generated by the models, which are the ones that best fit the models, as follows.
Property 1:Assume that the DDI data are generated from the true generation model according to formula (7). Assuming that each drug and side effect has exactly n u and n t non-zero latent features, respectively. Then, there exists a relationship between the sparsity of the model and the expected sparsity of the generated data as follows: Proof: For a pair of drug u, v to cause side effect t, then BÂ 1 h d ðuÞÂ 2 h d ðvÞÂ 3 h s ðtÞ > 0: This means that there is at least one nonzero entry of B corresponding to latent features of u, v and t. Since there are exactly n 2 u n t possible entries of B corresponding latent features of u, v and t, then the probability of a uniform sampling of entries of B to corresponding to these latent features is p 1 ¼ n 2 u nt K 2 D KS . This is the probability of having an interaction among the features (that generates a side effect data point).
Since entries of B are assumed to be randomly sampled according to a uniform distribution, the number of interactions when B have jBj 0 ¼ ð1 À s l ÞK 2 D K S non-zero entries follows a binomial distribution BinomialðjBj 0 ; p 1 Þ.
With the assumption that the hypergraph is generated from this generative process, the expected number of non-zero data points (the number of hyperedges) becomes jBj 0 :p 1 ¼ ð1 À s l Þ:n 2 u n t . The expected sparsity of the hypergraph becomes This result leads to Eðs d Þ > s l p 1 . It shows a relationship between the sparsity of the model (s l ) and the expected sparsity of the data generated by the model (Eðs d Þ). It shows that the model can be sparse but cannot be as sparse as we want. It can be a hint on setting sparsity of the model in learning processes.

Encoder
For the encoder, we use a hypergraph neural network with message passing (Yadati, 2020) to encode the input hypergraph and node features into latent spaces with node latent representations H and latent interactions B (for simplicity, B can be considered as a free parameter to learn).
where g w0 and f w1 are hypergraph neural networks based on message passing (Yadati, 2020) with parameters to learn w 0 , w 1 , H d ¼ fh d ðuÞ 2 R KD 0þ ju 2 V D g (node representations of drugs) and H s ¼ fh s ðtÞ 2 R KS 0þ jt 2 V S g (node representations of side effects). The formulation of each message passing layer has the following form: ; (13) where h ðlÞ ðaÞ is the representation of node a 2 V D [ V S at layer (l), r is an activation function, T is an aggregation function (e.g. an average function), N a ¼ fe 2 Eja 2 eg and M ðlÞ is a message passing function at layer (l) to pass information from neighbor nodes in hyperedge e to a: where M ðlÞ is a two-layer feedforward neural network, cðbÞ ¼ 1 if b 2 V D and cðbÞ ¼ À1 if b 2 V S are the node types.

Decoder
The reconstruction of the hypergraph is from the latent interaction assumption. The likelihood to reconstruct each triple e ¼ ðu; v; tÞ 2 V D Â V D Â V E follows a Gaussian distribution: where iðeÞ ¼ 1 if e 2 E, i(e) ¼ 0 if e 2 E ¼ V D Â V D Â V S =E, and m w0;w1 ðeÞ is the mean value for the latent interaction of e: Equation (17) is also the score for the interactions of triples (u, v, t) used for prediction. The likelihood for the decoder is: pðejB; HÞ:

Objective function
The objective function for our method is to maximize a posterior of the model. The objective function consists of two parts: one for loglikelihood of the model and one for the prior for sparsity control. Let K 2 R KDÂKDÂKS 0þ be the horseshoe prior parameter for B and s be the hyperparameter for the global sparsity of the horseshoe prior. We have the following objective function: where log pðGjB; HÞ is the log-likelihood of Equation (18) with H in Equation (11) and B in Equation (12), and log pðBjK; sÞ þ log pðKÞ is the logarithm of the horseshoe prior: We then use stochastic gradient descent libraries in the PyTorch framework for optimizing Equation (19).
We also consider two other variants of SPARSE: SPARSE O for not using any sparsity prior and SPARSE L for using Laplace prior (Lasso regularization), to examine the effect of using the horseshoe prior.

Experimental results
We validated SPARSE in two scenarios: synthetic data and real data. On the synthetic data, assuming that the data are generated from the latent interactions, we examined if SPARSE can recover the latent interactions under changing hyperparameters of data: the number of latent features, sparsity and amount of noise. On real data, we checked the prediction performance of SPARSE in comparison with state-of-the-art DDI prediction methods by using three real-world DDI datasets. Additionally, we evaluated if the top unknown predictions by SPARSE can be related to biological phenomena like functions and mechanisms.
For all experiments, we used 20-fold cross-validation by dividing hyperedges into 20-folds, keeping the same number of hyperedges (side effects) in each fold. We reported the mean and standard deviation of the two commonly used measures AUC and AUPR. Also, all reported results were the highest performances through grid searches of hyperparameters. There were three hyperparameters for grid searches for SPARSE: (i) latent feature sizes. The tested values were 30, 40, 50 and 60. We set the same size for all layers. (ii) Global sparsity s. The tested values were 0.01, 0.02, 0.03, 0.05 and 0.1 and (iii) the numbers of neural layers. The tested values were 1, 2 and 3. The hyperparameter values obtained were 50 for the latent feature size, s ¼ 0:02 for TWOSIDES and s ¼ 0:01 for CADDDI and JADERDDI, and the number of neural layers was 2. All experiments were run in a computer with Intel Core I7-9700 CPU, 8 GB GeForce RTX 2080 GPU and 32 GB RAM.

Data generation
The generation process for synthetic data consists of two steps: (i) generating latent interactions and (ii) generating triples of interacting drug-drug-side effects from the latent interactions, as follows.
1. Generating latent interactions. Given sets of indices of drug latent features: L D ¼ f1; 2; . . . ; K D g and side effect latent features: L S ¼ f1; 2; . . . ; K S g. a. Initialize a set of latent interactions A ¼ 1. b. For each k 2 L S : 1. Sample the number of drug latent feature pairs: n k ¼ RandomIntegerðMÞ, where M is the maximum number of pairs. 2. Sample n k pairs ði; jÞ 2 L D Â L D . For each pair (i, j): A ¼ A [ fði; j; kÞg. 2. Generating drug interactions: a. Generate drug and side effect latent features. Assume that there are V D drugs and V S side effects. 1. For each drug u 2 V D : i. Sample the number of drug latent features: For each side effect t 2 V S , sample the number of side effect latent feature n t ¼ RandomInterðN 2 Þ and Sample The final set of triples of drug-drug-side effects is E.
Finally, we have a synthetic dataset with triples of drug-drugside effects E and drug feature vectors F.

Experiments
The synthetic data has five hyperparameters: the number of drugs, the number of side effects, the number of latent interactions, data sparsity and the amount of noise (noise rate). We evaluated our methods by changing one hyperparameter, fixing the other four. The hyperparameters changed are (i) number of latent features, (ii) data sparsity and (iii) noise rate.
1) Changing the number of latent features. Setting: V D ¼ 400; V S ¼ 300, noise rate r ¼ 0.01. We changed K D ¼ K S 2 f5; 10; 20; 30; 40; 50g. For each (K D , K S ), we selected N 1 , N 2 and M such that the sparsity of the generated data is kept at 0.98.
Results: Figure 2a shows the results, where SPARSE O achieved the highest performances among the compared methods in all cases. We had the following two findings: 1. For the small number of latent features, the performance of CentSmoothie was close to SPARSE O (both AUC and AUPR were around 0.99 under K D ¼ K S ¼ 5). However, by increasing the number of latent features, the performance gap between SPARSE O and CentSmoothie also increased (gaps in AUC and AUPR were around 0.01 and 0.03, respectively, when K D ¼ K S ¼ 50). This result implies that CentSmoothie was unable to distinguish latent interactions clearly for a large number of latent interactions, while SPARSE O worked better for capturing multiple latent interactions.
2. The performances of SBM were lower than both CentSmoothie and SPARSE O , since SBM did not use the node features, which decreased the performance. HPNN, a similarity-based hypergraph neural network, had the lowest performance since the two drugs of a DDI do not necessarily have similarity in the data generated from latent interactions. Overall, these results indicated that SPARSE O can recover the latent interactions better than the other methods.
Results: Figure 2b shows the results, where SPARSE achieved the highest performance, followed by SPARSE L and SPARSE O . In particular, the performance advantage by SPARSE using sparsity control was clearer with higher sparsity. These results indicate that the horseshoe prior is suitable for learning sparse data.
Compared methods: We again compared SPARSE with two variants SPARSE L and SPARSE O to examine the effectiveness of the sparse priors to deal with noise.
Results: Figure 2c shows the results, where again SPARSE achieved the highest performances among the three methods for all different amounts of noise. When there are no noises, the performances of the three methods were very close to each other. However, as the amount of noise is increased, the advantage of SPARSE over the other two methods became clearer. For example, when the amount of noise is 20%, the gap between SPARSE and SPARSE L reached around 0.07, and the gap between SPARSE and SPARSE O was around 0.1.
These results suggest that the horseshoe prior could deal with noise better than the Laplace prior and the case with no sparsity prior.

Data description
We used three real-world datasets for DDI, namely TWOSIDES (Tatonetti et al., 2012), CADDDI and JADERDDI. To our knowledge, TWOSIDES is the largest benchmark dataset for DDI. The other two datasets, i.e. CADDDI and JADERDDI, were generated from Canada Vigilance Adverse Reaction Reports and Japanese Adverse Drug Event Reports, respectively, in the same manner as the way that TWOSIDES was generated from the adverse events reported to US Food and Drug Administration (Nguyen et al., 2021). For all datasets, we only chose small molecular drugs, which can be found in DrugBank. Also, we focused drugs appearing in more than five interactions (hyperedges) in each dataset. For each drug, we used a feature (binary) vector, with the size of 2329, consisting of 881 substructures and 1448 interacting proteins. Table 1 shows a summary statistics of the three real benchmark datasets, TWOSIDES, CADDDI and JADERDDI.

Predictive performance experiments
Compared methods: For our method, we used SPARSE and two variants SPARSE O and SPARSE L . We further used five methods as competing methods against SPARSE. These competing methods were CentSmoothie (Nguyen et al., 2021), the traditional similarity-based hypergraph neural network (HPNN) (Feng et al., 2019), two DDI graph-based graph neural networks: Decagon (Zitnik et al., 2018) and SpecConv (Kipf and Welling, 2016) and, a molecular graphbased graph neural network, MRGNN (Xu et al., 2019). Decagon and CentSmoothie provide available codes, and we ran them with the recommended settings. For MLNN, MGRNN, SpecConv, HPNN and SBM, we implemented them and did a grid search for finding the best hyperparameter values.
Results-Cross-validation predictive performance:  HPNN. On the other hand, the performances of SpecConv, Decagon and MRGNN were significantly lower. Amazingly, SPARSE O (SPARSE without any sparsity prior) achieved still better performance over CentSmoothie, particularly in AUPR. There was only one case (CADDDI), where the AUC of SPARSE was slightly smaller than that of CentSmoothie. We then ran t-test over the prediction results of these two methods, to examine the significance of the difference between CentSmoothie and SPARSE. The resultant Pvalue of t-test was 0.057, indicating that the performance advantage of CentSmoothie over SPARSE was NOT significant, under the regular significance level of 0.05. Also, it has to be noted that AUPR is more useful than AUC for imbalanced data (Saito and Rehmsmeier, 2015), which can be often seen practically. We emphasize that DDI is a typical example of this situation. In fact, the AUPR performance gap between SPARSE O and CentSmoothie reached around 1%, 5% and 12% in TWOSIDES, CADDDI and JADERDDI, respectively. The performance gap in JADERDDI is especially sizable. This might be caused by the high sparsity of JADERDDI (see Table 1). These results suggest that the latent interaction assumption in SPARSE is more reasonable and suitable for DDI prediction than CentSmoothie and the other competing methods. Among SPARSE, SPARSE L and SPARSE O , SPARSE achieved the highest performance. Note that the performance gap between SPARSE and SPARSE L in AUPR became clearer for more sparse data: e.g. only around 0.1% for TWOSIDES, while the gap reached around 1% for CADDDI and JADERDDI. Hence, we can see that with more sparse data, the horseshoe prior had advantage over Laplace prior and also the case with no sparsity prior.
Results-Unknown DDI prediction performance: We evaluated the predictive ability of unknown DDIs. That is, we first trained a model by using the whole TWOSIDES data (the largest dataset), then predicted the scores of unknown triples (drug-drug-side effect), and finally sorted the predicted triples in the descending order of the scores. We focused on the top 400 predictions of each method and checked the overlap with the DDIs stored in drugs.com (Drugs.com, 2021;Thelwall et al., 2017), a commonly used web checker for DDIs. Table 3 shows the number of overlaps between the DDIs in drugs.com and the top 400 predictions. SPARSE found 98 overlapped DDIs with drugs.com, this number being the highest and followed by CentSmoothie with 71 and HPNN with 48.

4.2.3
Case studies: interpretation of top 10 unknown predictions SPARSE is an SBM with latent features for drugs, side effects, and interactions. In particular, the model has connections between latent drug features and latent interactions. Thus from the trained model, we can extract the drug features, which are most associated with each drug latent feature and further extract the drug features most associated with each latent interaction through the corresponding latent drug feature. This means that we can retrieve drug features of a DDI if we can connect the DDI with the latent interactions. Algorithm 1 shows the pseudocode of this procedure (with T ¼ 20 in our cases). SPARSE is a sparse model, which allows only a limited number of latent interactions and eventually allows to extract only a limited number of drug features. This is a sizable advantage of SPARSE for understanding the biological/ chemical background behind predicted DDIs.For case studies, we extracted drug features (such as protein/pathway names) of the top unknown DDI predictions by using SPARSE, which was trained by the entire TWOSIDES. Table 4 shows the top 10 predictions (out of the 400 predictions in the experiment of the previous section) with the observable features associated with latent drug features [fifth column from the right-hand side. In this column, 'Not clear' means that to our current understanding of the potential DDI mechanisms, we could not explain the corresponding low-level (molecular level) background, although our algorithm could find associated drug features], the target protein of the corresponding drug using DrugBank (sixth column) and the corresponding reference to each DDI (seventh column). The top predictions are likely to be similar to each other, since the similar triples are likely to have similar scores. In fact the top predictions in Table 4 have large overlaps, but from the table, we could find the following four points: 1. The fourth and fifth predictions show the cases, where SPARSE could specify target proteins precisely, confirming the high credibility of these predictions and more importantly, approving the high ability of SPARSE for detecting unknown DDIs. 2. The first, second, third and sixth predictions show the cases, where SPARSE could identify possible interacting protein groups (fourth column), not necessarily directly associated with the drugs, indicating that SPARSE allows suggesting novel interactions as well as potential target proteins. 3. The validity of the seventh, eighth, ninth and tenth predictions might be understood by high-level views, like the connection between vision and dizziness/sedation. This result implies that SPARSE can predict probable interactions, which however cannot be straightforwardly inferred from low-level data. 4. Entirely, we could find relevant references for all top 10 predictions (Baldo, 2018;Fagiolini et al., 2004;Rho et al., 1997;Venkataraman et al., 2014), giving plausibility of these prediction and at the same time an additional layer of evidence for the usefulness of SPARSE in practical settings. To facilitate medical research and confirmation of our findings by subsequent clinical or preclinical studies, we provide the potential mechanisms as a Supplementary Material for our top predictions. Also, we discuss below the main biological mechanism for a predicted top 10 interaction: Naratriptan, Sertraline and abnormal ECG: Sertraline belongs to the selective serotonin reuptake inhibitor class antidepressants. Members of this class inhibit the reuptake of the neurotransmitter serotonin into cells (Ritter et al., 2019). Through this inhibition, sertraline increases serotonin levels outside of the cells and allows serotonin to remain longer at its site of action. Naratriptan is known to cause heart-related side effects through serotonin receptor agonism at serotonin type 1 receptors (Dodick et al., 2004;Ritter et al., 2019). Therefore, the predicted side effect can be a direct consequence of sertraline increasing the level of endogenous serotonin and naratriptan acting at serotonin receptors in the heart, with the resulting changes visible in electrocardiogram recordings.

Conclusion and discussion
We have proposed SPARSE to learn the latent representations of drugs, side effects and interactions, through hypergraph neural networks. SPARSE addresses three important issues of state-of-the-art DDI prediction, which have not been addressed by any other methods. Extensive empirical validation using both synthetic and real data showed that SPARSE outperformed all current, cutting-edge methods for DDI prediction, verifying the effectiveness of multiple types of latent interaction assumptions and the sparsity control setting of SPARSE.
Possible future work is to generalize SPARSE for higher-order drug interactions with multiple drugs. Another interesting direction might be to apply SPARSE to other sparse, high-dimensional data in bioinformatics.